# Replication Package for "Price Discovery on Decentralized Exchanges"

## Data and Code Availability Statement

The replication package contains the code and data necessary to reproduce the empirical analysis in the paper "Price Discovery on Decentralized Exchanges". It complies with the Review of Financial Studies Code and Data Sharing Policy and follows the Data and Code Availability Standard (DCAS) v1.0.

### Data Sources and Access

**Primary Data Sources:**
- **Amberdata**: Blockchain transaction data, mempool orders, DEX trade data, and CEX orderbook data
  - Access: Subscription

- **Binance**: CEX trade data
  - Access: Public API (https://binance-docs.github.io/apidocs/)

### Reproduction Requirements

The package enables reproduction of all published results using the provided data and code. The analysis can be executed on a standard computing environment with the specified software dependencies.

## Directory Structure

Files are compressed in tar.gz files. To execute the scripts, uncompress all files first.

```
replication/
├── code/                    # Python analysis scripts
├── input/                   # Raw data files
├── output/                  # Final results and tables
├── temp/                    # Intermediate processed data
└── README.md                # README file
```

## Software Requirements and Dependencies

### Python Environment
- **Python**: Version 3.8.0 or higher
- **Required packages** (with minimum versions):
  - pandas >= 1.3.0
  - numpy >= 1.21.0
  - scipy >= 1.7.0
  - matplotlib >= 3.4.0
  - seaborn >= 0.11.0
  - statsmodels >= 0.12.0
  - rpy2 >= 3.4.0 (for R integration)
  - multiprocessing (built-in)

### R Environment
- **R**: Version 4.0.0 or higher
- **Required packages**:
  - vars >= 1.5.0 (for structural VAR estimation)
  - Additional packages installed via `install-r-packages.py`

### System Requirements
- **Processor**: Multi-core processor 
- **Memory**: Minimum 16GB RAM 
- **Storage**: Minimum 160GB free disk space for compressed data files
- **Operating System**: macOS or Linux

## Data Description

### Input Data Structure

The `input/` directory contains the following data sources:

#### Trading Data
- **`trades-aggregated-dex/`**: DEX trade data from Uniswap V2
  - **Format**: Daily compressed CSV files (.csv.gz)
  - **Organization**: By trading pairs (AAVE-ETH, ETH-USDT, WBTC-ETH, LINK-ETH)
  - **Variables**: 
    - `timestamp`: Trade execution timestamp
    - `price`: Trade price in token units
    - `volume`: Trade volume
    - `isBuy`: Boolean indicating buy/sell direction
    - `gasPrice`: Gas price paid for transaction
    - `blockNumber`: Ethereum block number
    - `transactionHash`: Unique transaction identifier
  - **Time Period**: November 18, 2020 to August 4, 2021

- **`binance-trades/`**: Centralized exchange trade data from Binance
  - **Format**: Daily compressed CSV files (.csv.gz)
  - **Organization**: Same trading pairs as DEX data
  - **Variables**:
    - `timestamp`: Trade execution timestamp
    - `price`: Trade price
    - `quantity`: Trade quantity
    - `isBuyerMaker`: Boolean indicating maker/taker
    - `tradeId`: Unique trade identifier
  - **Time Period**: November 18, 2020 to August 4, 2021

#### Liquidity Data
- **`orderbook-cex/`**: Centralized exchange order book data
  - **Format**: Compressed CSV files (.csv.gz)
  - **Variables**:
    - `timestamp`: Order book snapshot timestamp
    - `bidPrice`, `askPrice`: Best bid and ask prices
    - `bidSize`, `askSize`: Best bid and ask sizes
    - `midPrice`: Mid-quote price
  - **Frequency**: Second-level snapshots

- **`liquidity-dex/`**: DEX liquidity pool data
  - **Format**: Compressed CSV files (.csv.gz)
  - **Variables**:
    - `timestamp`: Pool state timestamp
    - `reserve0`, `reserve1`: Token reserves in pool
    - `totalSupply`: Total liquidity token supply
    - `price`: Pool price (reserve1/reserve0)
  - **Frequency**: Block-level updates

#### Blockchain Data
- **`blockchain-transactions/`**: Ethereum blockchain transaction data
  - **Format**: Compressed CSV files (.csv.gz)
  - **Variables**:
    - `blockNumber`: Ethereum block number
    - `transactionHash`: Transaction hash
    - `gasPrice`: Gas price in Gwei
    - `gasUsed`: Gas consumed
    - `timestamp`: Block timestamp
    - `from`, `to`: Transaction sender and receiver addresses
  - **Coverage**: All transactions in analyzed blocks

- **`blockchain-info/`**: Ethereum block-level information
  - **Format**: Compressed CSV files (.csv.gz)
  - **Variables**:
    - `number`: Block number
    - `baseFeePerGas`: Base fee per gas for the block (in Wei, converted to Gwei in processing)
  - **Coverage**: Block-level gas fee information for analyzed blocks

- **`mempool-orders/`**: Mempool transaction data
  - **Format**: Compressed CSV files (.csv.gz)
  - **Variables**:
    - `timestamp`: Mempool entry timestamp
    - `transactionHash`: Transaction hash
    - `gasPrice`: Gas price
    - `status`: Transaction status (pending/confirmed)
  - **Coverage**: Pending transactions before block inclusion

#### Additional Data
- **`sandwich-attacks/`**: Identified sandwich attack transactions
  - **Format**: CSV files (.csv)
  - **Variables**: Attack transaction details, victim transactions, profit calculations

- **`pool-volume/`**: Historical pool volume data
  - **Format**: CSV files (.csv)
  - **Variables**: 
    - `timestamp`: Date/timestamp
    - `volumeTotalUSD`: Total trading volume in USD
    - `tradesTotal`: Total number of trades
  - **File naming**: `pool-volume_{pair_name}_{exchange}.csv`

- **`asset-ranking_{date}.csv`**: Token ranking data for pair selection
  - **Format**: CSV files (.csv)
  - **Variables**: 
    - `symbol`: Token symbol
    - `address`: Token contract address
  - **File naming**: `asset-ranking_{date}.csv` (e.g., `asset-ranking_2020-11-18.csv`)
  - **Usage**: Used to select top-ranked tokens for pair selection

- **`avail-pairs_uniswapv2.csv.gz`**: Available trading pairs on Uniswap V2
  - **Format**: Compressed CSV file (.csv.gz)
  - **Variables**:
    - `pairName`: Pair name (e.g., "AAVE_WETH")
    - `baseAddress`: Base token contract address
    - `quoteAddress`: Quote token contract address
  - **Usage**: Used to identify pairs available on Uniswap V2

- **`avail-pairs_binance.csv`**: Available trading pairs on Binance
  - **Format**: CSV file (.csv)
  - **Variables**:
    - `pairName`: Pair name in Binance format
  - **Usage**: Used to match Uniswap pairs with Binance pairs

- **`dex-pair-info.csv`**: DEX pair metadata information
  - **Format**: CSV file (.csv)
  - **Variables**:
    - `pairName`: Pair name
    - `baseAddress`: Base token contract address
  - **Usage**: Used to map pair names to token addresses for liquidity data matching

## Script Descriptions

### Master Scripts

#### `master.py`
**Purpose**: Orchestrates the data processing and analysis pipeline using parallel processing
- **Input**: Depends on which steps are enabled (see individual script descriptions)
- **Output**: Depends on which steps are enabled (see individual script descriptions)
- **Key parameters**: `include_01` through `include_08` control different processing steps

### Data Processing Scripts

#### `reshape-dex-trades.py`
**Purpose**: Reshapes and standardizes DEX trade data
- **Input**: `input/trades-aggregated-dex/`, `input/blockchain-transactions/`, `input/blockchain-info/`
- **Output**: `temp/trades-reshaped-dex/`

#### `merge-trades-orders.py`
**Purpose**: Merges executed trades with mempool orders
- **Input**: `temp/trades-reshaped-dex/`, `input/mempool-orders/`
- **Output**: `temp/mempool-orders-matched/`

#### `classify-gas-price.py`
**Purpose**: Classifies transactions by gas price levels
- **Input**: `temp/mempool-orders-matched/`
- **Output**: `temp/mempool-orders-classified/`

#### `calculate-trade-flows.py`
**Purpose**: Calculates aggregated trade flows for SVAR analysis
- **Input**: `temp/mempool-orders-classified/`, `temp/timestamp/`, `input/binance-trades/`
- **Output**: `temp/aggregated-trade-flow/`

#### `calculate-trade-price-impact.py`
**Purpose**: Computes price impact measures for trades
- **Input**: `temp/mempool-orders-classified/`, `input/liquidity-dex/`
- **Output**: `temp/trade-price-impact/`

#### `calculate-trade-vlm.py`
**Purpose**: Calculates trade volumes in addition to the flows
- **Input**: `temp/mempool-orders-classified/`, `input/binance-trades/`
- **Output**: `temp/aggregated-vlm/`

#### `estimate-svar-tradeflow.py`
**Purpose**: Estimates structural VAR models for trade flow analysis
- **Input**: `temp/aggregated-trade-flow/`
- **Output**: `temp/structural-var-tradeflow/`

### Analysis Scripts

#### `select-pair.py`
**Purpose**: Utility script for selecting trading pairs
- **Input**: `input/asset-ranking_{date}.csv`, `input/avail-pairs_uniswapv2.csv.gz`, `input/avail-pairs_binance.csv`
- **Output**: `temp/initial-sample-pairs_{ranking_threshold}.csv`

#### `select-sample.py`
**Purpose**: Selects sample data for analysis
- **Input**: `temp/initial-sample-pairs_{ranking_threshold}.csv`
- **Output**: `output/adv.pdf`

#### `test-trade-price-impact.py`
**Purpose**: Tests and analyzes trade price impact patterns
- **Input**: `temp/trade-price-impact/`, `temp/trade-size-quantiles.pickle`
- **Output**: `output/trade-price-impacts-reg-table_all-specs_*.tex`, `output/trade-price-impacts-reg-table_final*.tex`

#### `analyze-svar-bounds-flashbots.py`
**Purpose**: Analyzes SVAR bounds and generates impulse response functions
- **Input**: `temp/structural-var-tradeflow/`
- **Output**: `output/info-shr_bounds_*.tex`, `output/ppi_bounds_*.tex`, `output/return-irfs_bounds_*.pdf`

#### `plot-gas-informed-traders.py`
**Purpose**: Generates plots comparing gas prices for informed vs uninformed traders
- **Input**: `temp/trade-price-impact/`, `temp/trade-size-quantiles.pickle`
- **Output**: `output/gas-price_informed-vs-uninformed_bar-chart.pdf`

#### `analyze-block-position-informed-traders.py`
**Purpose**: Analyzes block position effects on informed trading
- **Input**: `temp/trade-price-impact/`, `temp/trade-size-quantiles.pickle`
- **Output**: `output/gas-price-block-position_informed-vs-uninformed_bar-chart_rp.pdf`

#### `analyze-price-impact-informed-traders.py`
**Purpose**: Analyzes price impact patterns of informed traders
- **Input**: `temp/trade-price-impact/`, `temp/trade-size-quantiles.pickle`
- **Output**: `output/gas-price-informed-address-reg-table_final.tex`, `output/price-impact-informed-address-reg-table_final.tex`

#### `test-informed-trader-gas.py`
**Purpose**: Tests the informed trader gas hypothesis
- **Input**: `temp/trade-price-impact/`, `temp/timestamp.csv.gz`, `temp/block-gas-stats.csv.gz`
- **Output**: `output/panel-regression-marginal-gas.tex`

#### `analyze-private-tx-stats.py`
**Purpose**: Analyzes private transaction statistics
- **Input**: `temp/private-tx-stats/`
- **Output**: `output/frac-private-vlm.pdf`

#### `analyze-explicit-competition.py`
**Purpose**: Analyzes explicit competition between traders
- **Input**: `temp/mempool-orders-matched/`, `temp/timestamp/`
- **Output**: `output/frac-non-pga-trades-excessive-gas.tex`

### Summary Statistics Scripts

#### `compute-summary-stats-trade.py`
**Purpose**: Computes summary statistics for trade data
- **Input**: `temp/mempool-orders-matched/`
- **Output**: `output/summary-stats_trade.tex`

#### `compute-summary-stats-liquidity.py`
**Purpose**: Computes summary statistics for liquidity data
- **Input**: `temp/mempool-orders-matched/`, `input/binance-trades/`
- **Output**: `output/summary-stats_liquidity.tex`

#### `compute-summary-stats-trade-flow.py`
**Purpose**: Computes summary statistics for trade flows
- **Input**: `temp/aggregated-trade-flow/`
- **Output**: `output/summary-stats_trade-flow.tex`

#### `compute-summary-stats-vlm.py`
**Purpose**: Computes summary statistics for volume data
- **Input**: `temp/aggregated-vlm/`
- **Output**: `output/summary-stats_vlm.tex`

### Specialized Analysis Scripts

#### `identify-arb-trades.py`
**Purpose**: Identifies arbitrage trades between DEX and CEX
- **Input**: `temp/trades-reshaped-dex/`, `input/orderbook-cex/`, `input/liquidity-dex/`, `input/dex-pair-info.csv`, `temp/sandwich-attacks-flagged/`
- **Output**: `temp/trades-arb/`

#### `flag-private-sandwich-attacks.py`
**Purpose**: Identifies sandwich attack transactions
- **Input**: `input/sandwich-attacks/`, `input/mempool-orders/`
- **Output**: `temp/sandwich-attacks-flagged/`

#### `calculate-arb-trade-stats.py`
**Purpose**: Calculates statistics for arbitrage trades
- **Input**: `temp/trades-reshaped-dex/`, `temp/trades-arb/`
- **Output**: `output/stats-arb.tex`

#### `calculate-private-tx-stats.py`
**Purpose**: Analyzes private transaction statistics
- **Input**: `temp/mempool-orders-matched/`
- **Output**: `temp/private-tx-stats/`

#### `get-block-timestamp.py`
**Purpose**: Retrieves block timestamp information
- **Input**: `input/blockchain-transactions/blockchain-transactions_{date}.csv.gz`
- **Output**: `temp/timestamp/timestamp_{date}.csv.gz`

#### `fit-gas-tx-size.py`
**Purpose**: Fits gas price to transaction size relationships
- **Input**: `temp/mempool-orders-matched/`
- **Output**: `temp/trade-size-quantiles.pickle`, `temp/gas-stats_tx-size.pickle`

#### `calculate-block-gas-statistics.py`
**Purpose**: Calculates block-level gas statistics
- **Input**: `input/blockchain-transactions/blockchain-transactions_{date}.csv.gz`
- **Output**: `temp/block-gas-stats.csv.gz`


### Collection and Aggregation Scripts

#### `collect-svar.py`
**Purpose**: Collects and aggregates SVAR estimation results
- **Input**: `temp/structural-var-tradeflow/`
- **Output**: `temp/structural-var-tradeflow/`

#### `collect-trade-price-impact.py`
**Purpose**: Collects and aggregates trade price impact results
- **Input**: `temp/trade-price-impact/`
- **Output**: `temp/trade-price-impact/`

### Utility Scripts

#### `plot_functions.py`
**Purpose**: Contains plotting and visualization functions

#### `install-r-packages.py`
**Purpose**: Installs required R packages

## Replication Workflow

### Step 1: Environment Setup
1. Install Python dependencies
2. Install R and required packages:
   ```bash
   python install-r-packages.py
   ```

### Step 2: Data Processing Pipeline
Execute the data processing steps in order:

1. **Reshape DEX trades**:
   ```bash
   # Edit master.py to set include_01 = True
   python master.py
   ```

2. **Merge trades with orders**:
   ```bash
   # Set include_02 = True
   python master.py
   ```

3. **Classify gas prices**:
   ```bash
   # Set include_03 = True
   python master.py
   ```
Note that in order to run 3, one needs to first execute the ones in the ``Special Scripts'' to generate needed intermediate data

4. **Identify arbitrage trades**:
   ```bash
   # Set include_04 = True
   python master.py
   ```

5. **Calculate trade flows**:
   ```bash
   # Set include_05 = True
   python master.py
   ```

6. **Estimate the SVAR**
    ```bash
    # Edit master.py to set include_06 = True
    python master.py
    ```

7. **Calculate price impact**:
   ```bash
   # Set include_07 = True
   python master.py
   ```

8. **Calculate trade volumes**:
   ```bash
   # Set include_08 = True
   python master.py
   ```

### Step 5: Collect Results
Aggregate SVAR results:

```bash
python collect-svar.py
```

```bash
python collect-trade-price-impact.py
```

### Step 6: Generate Summary Statistics
Compute summary statistics:

```bash
python compute-summary-stats-trade.py          # Output: summary-stats_trade.tex
python compute-summary-stats-liquidity.py      # Output: summary-stats_liquidity.tex
python compute-summary-stats-trade-flow.py     # Output: summary-stats_trade-flow.tex
python compute-summary-stats-vlm.py            # Output: summary-stats_vlm.tex
```

### Step 7: Generate Final Analysis
Run final analysis scripts:

```bash
python test-informed-trader-gas.py
python analyze-svar-bounds-flashbots.py
python analyze-block-position-informed-traders.py
python analyze-price-impact-informed-traders.py
python test-informed-trader-gas.py
python analyze-explicit-competition.py
```

## Output Files

The `output/` directory contains the final results:

### Tables (`.tex` files)
- `summary-stats_trade.tex`: Trade-level summary statistics
- `summary-stats_liquidity.tex`: Liquidity summary statistics
- `summary-stats_trade-flow.tex`: Trade flow summary statistics
- `summary-stats_vlm.tex`: Trade volume summary statistics
- `stats-arb.tex`: Arbitrage trade statistics
- `gas-price-informed-address-reg-table_final.tex`: Gas price and block position regression results
- `price-impact-informed-address-reg-table_final.tex`: Price impact regression results
- `trade-price-impacts-reg-table_final.tex`: Trade price impact results
- `panel-regression-marginal-gas.tex`: Marginal gas regression results
- `frac-non-pga-trades-excessive-gas.tex`: Fraction of non-PGA trades with excessive gas
- Various SVAR bounds tables (`info-shr_bounds_*.tex`, `ppi_bounds_*.tex`)

### Figures (`.pdf` files)
- `frac-private-vlm.pdf`: Fraction of private transaction volume 
- `gas-price_informed-vs-uninformed_bar-chart_*.pdf`: Gas price and block position of the trades from the informed vs. uninformed traders
- `return-irfs_bounds_*.pdf`: Return impulse response functions
- `adv.pdf`: Average daily trading volume figure for various tokens

## License and Usage

The replication package is provided under the MIT License to permit replication by independent researchers. The package includes:

- Complete data processing pipeline
- All analysis code and scripts
- Documentation for reproduction
- Intermediate and final results